中国邮电高校学报(英文) ›› 2012, Vol. 19 ›› Issue (6): 63-72.doi: 10.1016/S1005-8885(11)60319-1

• Networks • 上一篇    下一篇

Studying cost-sensitive learning for multi-class imbalance in Internet traffic classification

刘珍1,刘琼2   

  1. 1. 华南理工大学
    2. 华南理工大学软件学院
  • 收稿日期:2012-05-03 修回日期:2012-09-23 出版日期:2012-12-31 发布日期:2012-12-14
  • 通讯作者: 刘琼 E-mail:liuqiong@scut.edu.cn
  • 基金资助:

    This work was supported by the National Basic Research Program of China (2007CB07100, 2007CB07106).

Studying cost-sensitive learning for multi-class imbalance in Internet traffic classification

Zhen Liu1,Qiong Liu   

  • Received:2012-05-03 Revised:2012-09-23 Online:2012-12-31 Published:2012-12-14
  • Contact: Qiong Liu E-mail:liuqiong@scut.edu.cn
  • Supported by:

    This work was supported by the National Basic Research Program of China (2007CB07100, 2007CB07106).

摘要:

Cost-sensitive learning has been applied to resolve the multi-class imbalance problem in Internet traffic classification and it has achieved considerable results. But the classification performance on the minority classes with a few bytes is still unhopeful because the existing research only focuses on the classes with a large amount of bytes. Therefore, the class-dependent misclassification cost is studied. Firstly, the flow rate based cost matrix (FCM) is investigated. Secondly, a new cost matrix named weighted cost matrix (WCM) is proposed, which calculates a reasonable weight for each cost of FCM by regarding the data imbalance degree and classification accuracy of each class. It is able to further improve the classification performance on the difficult minority class (the class with more flows but worse classification accuracy). Experimental results on twelve real traffic datasets show that FCM and WCM obtain more than 92% flow g-mean and 80% byte g-mean on average; on the test set collected one year later, WCM outperforms FCM in terms of stability.

关键词:

Internet traffic classification, minority class, cost matrix, machine learning

Abstract:

Cost-sensitive learning has been applied to resolve the multi-class imbalance problem in Internet traffic classification and it has achieved considerable results. But the classification performance on the minority classes with a few bytes is still unhopeful because the existing research only focuses on the classes with a large amount of bytes. Therefore, the class-dependent misclassification cost is studied. Firstly, the flow rate based cost matrix (FCM) is investigated. Secondly, a new cost matrix named weighted cost matrix (WCM) is proposed, which calculates a reasonable weight for each cost of FCM by regarding the data imbalance degree and classification accuracy of each class. It is able to further improve the classification performance on the difficult minority class (the class with more flows but worse classification accuracy). Experimental results on twelve real traffic datasets show that FCM and WCM obtain more than 92% flow g-mean and 80% byte g-mean on average; on the test set collected one year later, WCM outperforms FCM in terms of stability.

Key words:

Internet traffic classification, minority class, cost matrix, machine learning